-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Response and Response model #61
Conversation
Simplified the response variable and the model.
|
||
cdef class Response: | ||
cdef float _reward | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @troiwill. What is the significance of wrapping a float inside Response
? Why does this change need to happen in basic.pyx
? This affects all other programs currently using pomdp-py
. This would not be acceptable.
""" | ||
A Response class that only handles a scalar reward. Subclasses of Response can add | ||
more (scalar or vector) variables. But the subclasses must implement how to handle | ||
arithmetic and comparison operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The idea of vector reward is interesting, and I value the contribution. But currently the changes are not backwards compatible. I suggest the following changes:
- Create a file
pomdp_py/framework/generalization.pyx
in which:- Define a class called
ResponseAgent
that inheritsAgent
but it takes in a response model instead of a reward model. - Define
Response
,ResponseModel
and other necessary components there. - Define a
sample_generative_function
specific forResponseAgent
.
- Define a class called
- Create a file
pomdp_py/algorithms/cc_pomcp.pyx
in which you can implement the CC-POMCP algorithm.
That way you don't need to make changes to the existing code using float-based reward.
Please keep the changes to the original code in framework
and algorithms
as little as possible. In addition, you shouldn't need to modify all the examples to achieve your contribution.
Also, please provide a meaningful example that demonstrates CC-POMCP. Currently the test in test_response.py
is rather trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zkytony, thanks for the notes. Your suggestion is helpful.
Ideally, I would have liked to reuse methods from your PO-UCT and POMCP implementations in the CC-POMCP algorithm. However, I cannot do so via subclassing because PO-UCT and POMCP return a single value in rollout
and simulate
, while CC-POMCP returns two values.
To avoid a backwards compatibility issue, I could reproduce most of the PO-UCT and POMCP code in CC-POMCP. Please let me know how this approach sounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you subclass and override rollout and simulate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troiwill does that make sense? You can still let CC-POMCP inherit PO-UCT, but override rollout and simulate so they can internally work with multi-valued rewards. You might want to create custom VNode and/or QNode classes, since the value stored in the nodes are also not just floats, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zkytony, that makes sense. Once I've made all the changes, I will create another pull request. Thanks for the note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will close this PR for now.
This pull request is the first stage of implementing the cost-constrained POMCP (CC-POMCP) algorithm. This algorithm cannot be added to the repository directly because it has additional variables and operations not present in the PO-UCT and POMCP algorithms. One example is cost constraints and their corresponding operations.
To accommodate costs and other future variables, I propose to use a generic model, called a Response model, and a corresponding output, called a response. The name "response" comes from the notion of independent and dependent variables, where a response (reward, cost, etc.) depends on the interaction with the real or simulated environment. Thus, a response model is a wrapper for more specific models, such as reward and cost models (and any others that will follow in the future). By extension, a response is a wrapper for the reward, cost, etc.
The pull request has the following:
ResponseModel
andResponse
classes in the basics files,ResponseModel
instead ofRewardModel
directly,tiger
,rocksample
,mos
,load_unload
, andtag
problems,test_response.py
,test_all.py
script,tiger
,rocksample
,mos
,load_unload
, andtag
problems.ResponseModel
andResponse
classes.